-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Allow truncation when embedding #14493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@ngxson Please check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be an intermediate solution for now, but I think the whole embedding system need to be reworked at some point.
Currently, we don't have to process the whole input in one single batch if we're using last
pooling with causal attention. Having this can be useful for newer models like Qwen, while also unlock using one single model for both text generation & embeddings. @ggerganov WDYT?
// Note: If the input was truncated (slot.truncated == true), this embedding | ||
// represents only the processed portion of the original input | ||
for (int i = 0; i < batch.n_tokens; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An easier way to avoid touching this loop - which now becomes more and more fragile - is to simply handle this truncation in launch_slot_with_task
. You can truncate slot.prompt_tokens
right after it's being std::moved
I agree that the embeddings handling in This change specifically does not seems to improve the situation - we introduce another special case with extra parameter and branches. Better to implement this properly later. |
Sometimes it frustrates me because llama-server automatically stops when slot.n_ctx < input token length in embedding task. I want it to be able to truncate the input token, as an option.